Improved Concentration Bounds for Count-Sketch 

Gregory T. Minton Eric Price 

MIT MIT 



Abstract 

We present a refined analysis of the classic Count-Sketch streaming heavy hitters algo- 
rithm CCF02 . Count-Sketch uses 0(k\ogn) linear measurements of a vector x £ K n to give an 
estimate x of x. The standard analysis shows that this estimate x satisfies — ^IISo < II^Tferlll/k; 
where Xr^r is the vector containing all but the largest k coordinates of x. Our main result is 
that most of the coordinates of x have substantially less error than this upper bound; namely, 
for any c < O(logn), we show that each coordinate i satisfies 

{Xi-Xif < — 

log n k 

with probability 1 — 2~ c , as long as the hash functions are fully independent. This subsumes 
the previous bound. Our improved point estimates also give better results for i-i recovery of 
exactly fc-sparse estimates x* when x is drawn from a distribution with suitable decay, such as 
a power law. 

Our proof shows that any random variable with positive real Fourier transform and finite 
variance concentrates around at least as well as a Gaussian. This result, which may be of 
independent interest, gives good concentration even when the noise does not converge to a 
Gaussian. 



1 Introduction 

The heavy hitters problem and the closely related sparse recovery problem are two of the most 
fundamental problems in the field of sketching and streaming algorithms [GI10, CH10, Mut05j. 
The goal is to efficiently identify and estimate the k largest coordinates of an n-dimensional vector 
using a linear sketch Ax of x, where A £ flj mxn bas m = 0(klog c n) rows. The strongest commonly 
used formal guarantee for the quality of such an estimate is the loo 1^2 guarantee: this is a bound 
for the estimate x recovered from Ax which is of the form 

\\x — x||^o < Ikpjlll/^) (1) 

where Xpj denotes the vector obtained from x by replacing its largest k coordinates with 0. 

The classic approach for this problem is the Count-Sketch algorithm of Charikar et al. [CCF02J, 
which uses m = 0(k log n) measurements and satisfies ([1]) with 1 — l/n 6 ^ 1 ) probability. It is simple, 
practical, and gives the best known theoretical performance in many settings. This paper shows 
that the quality of the approximation x given by Count-Sketch is much better than the standard 
bound ([I]) suggests. While ([I]) gives a bound on the worst-case error of x, we prove that most 
coordinates of x have a 1 / \T°g n factor less error than this worst case. 



The Count-Sketch of a vector x using R rows of C columns is defined as follows. For u G [R], 
we choose hash functions h u : [n] —> [C] and s u : [n] — > {±1}- The sketch is 

i:/i u (i)=u 

which consists of RC linear measurements. The estimate x is given by 

Xi = median s u (i)y uh uy 

Setting C = 0(k) and i? = O(logn), |CCF02| proves that ^) holds with 1 - l/rt e W probability. 
As per [CM061lGI10| . the £00/ £2 guarantee can be converted into an £2/ £2 guarantee: if x* contains 
the largest 2k coordinates of x, then 

||x*-x||!<0(||x M |H). 

Our main result is the following strengthening of the analysis in [CCF02] for the concentration of 
the point estimates Xi resulting from Count-Sketch, assuming the hash functions are fully random: 

Main Result (Theorem 14. 11) . Consider the estimate x of x from Count-Sketch using R rows 
and k > 2 columns, with fully random hash functions. For any t < R and each index i, 



Pr 



(Xj Xi) > 



2 ^ t " X lk]\\2 



2" 

' <2e- n{t \ 



The standard analysis |CCF02| gives the special case of t = R. One gets (P) by setting t = 
R = O(logn) and applying a union bound. Our result says that most coordinates of x have much 
better estimates than this worst-case bound. It gives no direct improvement to (JTJ) , but this is 
expected: an improved £00/ '£2 bound would allow for improved e dependence in £2/ £2 recovery, 
which is impossible [P W11| . 

Although an improvement in £2 reconstruction is impossible for general vectors x, for some 
common distributions on x it is possible. For example, if x follows the power law x^ = i~ a for some 
constant a > 0.5, Theorem 14.11 implies that we can reconstruct a k-sparse x* from x satisfying 

11^- ^111 <(! + 1^)11^111 

with constant probability (see Theorem 14. 2p . This beats the 1 + e approximation of traditional 
analysis. Previous work [Prill] combined the Count-Sketch with another sketch to get a (1 + -^===) 
factor approximation for this problem. 

Our analysis requires that the hash functions be fully random. This is unfortunate because fully 
random hash functions would take up more space than the sketch itself, but there are some reasons 
why this constraint is not too problematic. One reason is that Nisan's pseudorandom number 
generator }Nis92j lets us store the hash functions with only a logn factor increase in space. Then 
if we wish to run Count-Sketch on multiple different vectors, we can reuse the hash functions. A 
second reason is that one expects bounded independence to suffice as long as the vector x itself 
has sufficient entropy. A result of this form is known [MV08| when supp(x) is drawn at random 
from a much larger domain. For example, if supp(x) contains n 1 / 3 random coordinates out of n, 
then [MV08] implies near- uniformity with 4- wise independence. 
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Our Techniques Our basic strategy is to translate the problem of bounding Count-Sketch error 
into a problem of proving a strong concentration result for a certain class of random variables. 
This, in turn, we solve by analyzing the Fourier transform of such variables. 

In more detail, the argument proceeds as follows. The error xi — Xi is, by definition, the median 
over rows of error terms coming from the different coordinates which hash to the same column as i. 
For each row, we separate the error term into contributions from (i) the largest k coordinates j E [k] 
and (ii) the remaining coordinates j 6 [k]. The error of type (i) is zero with constant probability, 
and we bound the error of type (ii) with our concentration result. We then get a bound on Xi — x^ 
by using Chernoff bounds to conclude that if each of R symmetric random variables has a y/c/R 
chance of being small, then the median has a 1 — 2e _C//2 chance of being small. 

The concentration result we prove is a bound of the form Pr[|X| < e] > S7(e), where X has 
variance 1 and is a sum of independent random variables, each of which is symmetric and zero with 
probability at least 1/2. Such a bound certainly holds in the limit as X converges to a Gaussian, 
but we need it to be true before X converges. To see why this is subtle, consider the sum of n 
independent H/y/n variables. The Berry-Esseen theorem gives our bound for e > 1/y/n, but the 
bound is actually false for e < 1/y/n when n is odd. When n is even, we can pair up the variables 
to get n/2 independent {0, ±2/y/n} variables. These variables are zero with 1/2 probability, so our 
bound applies for arbitrarily small e. What distinguishes even n from odd n? 

The key for our argument is that, for a symmetric random variable X with at least 1/2 proba- 
bility of being 0, the Fourier transform of X is nonnegative. The Fourier transform of the triangle 
filter max{l — |x|/e,0} is also nonnegative. We use the convolution theorem to translate the ex- 
pectation of the triangle filter into an integral in Fourier space, and then use positivity to note 
that we can bound that integral over all Fourier space by the integral over small frequencies. This 
we control directly by using the quadratic Taylor series approximation to cosx. Because a lower 
bound on the expectation of the triangle filter also gives a lower bound on Pr[|X| < e], this proves 
what we want. 

Related Work Count-Min and Count-Median ( 'MO 1] are similar algorithms to Count-Sketch 
that get the weaker loo/P-i guarantee. Their sketch matrix is the same as Count-Sketch's but with 
s u (i) = 1 always. This means that the random variables involved in the error are not symmetric, 
so the estimates do not have the additional concentration that we show in this paper. 

The application of heavy hitters algorithms to power law distributions was studied in [CM05] . 

2 Preliminaries 

Notation We use / > g to denote f = U(g) and / < g to denote f = 0(g). 

In the statement of Theorem 14. 1} xj^ denotes the vector consisting of all but the largest k 
coordinates of x. More generally, we think of the coordinates of x as being sorted, \x\\ > \x2\ > 
•• > \x n \. This is purely a notational convenience, possible because Count-Sketch is invariant 
under permutation of coordinates. 

Given a real- valued random variable X, its Fourier transform is the function 



In general J-(t) is complex- valued. However, our random variables are all symmetric; in this case 
F(t) is real- valued and equals E[cos(2-7rXt)]. 
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3 Concentration Lemmas 



The following is the key lemma for our proof. 

Lemma 3.1. Let X be a symmetric, real-valued random variable with variance 1, and suppose that 
its Fourier transform J-{t) is nonnegative. Then, for e < 1, ~Pi[\X\ < e] > e. 



Using this relation and switching the order of integration, E[T e (X)] = sin ^ te ^ J~{t) dt. The 
integrand is nonnegative, so we get a lower bound on E[T e (X)] by integrating only over the interval 

[— 3!^) 2^]- On this interval we have T{t) > | and, because e < 2ir, s ™^^ is bounded below by its 
value at t = 1/{2tt). Putting this together, we find that 



For e < 1 we have sm ^' ' > e. Now noting that Pr[|X| < e] > E[T e (X)] completes the proof. □ 

Corollary 3.2. Let {Xi : i G [n]} be independent symmetric random variables such that PrpQ = 
0] > 1/2 for each i. Set X = £? =1 Xi and a 2 = E[X 2 }. For e < 1, Pt[\X\ < ea] > e. 

Proof. For each i G [n], let pi = Pr[Xj = 0]. The Fourier transform of Xi is J-i{t) = pi + (1 — 
Pi) E[cos(27rXj£) I Xi ^ 0] > Pi + (1 — pi)(—l). Because p\ > 1/2, this is nonnegative. Now Xja is 
a symmetric random variable with nonnegative Fourier transform Yll=i / a ) an d with variance 
K[(X/a) 2 ] = 1; applying Lemma I5TT1 to it gives the desired bound. □ 

Note that Lemma [3. II is not true without the positivity assumption, and in particular Corollary 
13.21 is not true when Pr[Xi = 0] is small. Indeed, it seems intuitive that we get strong concentration 
around as a consequence of the large probability of each individual variable being 0. We also 
remark that there are analogs of Lemma 13.11 and Corollary 13.21 using only first moment bounds. 
The proof is nearly identical, so we omit it. 

We also need the following lemma for concentration of medians. 

Lemma 3.3. Suppose X\, ... ,Xf are independent symmetric random variables such that, for some 
r, c > 0, we have Pr[|Xj| < r] > y/ c/t for all i G [t]. Then 




Proof. Because cos x > 1 — ^x 2 holds for all i 6 I, we have 

T(t) > E[l - l(2irXt) 2 } = 1 - 2-K 2 t 2 VteK. 
In particular, F{t) > \ for t £ [— ^A. Let T e (x) be the triangle filter 




and recall the Fourier transform relation 






Pr median >r < 2e c l 2 . 

_ »e[t] 
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Proof. Let E{ denote the indicator for the event that Xi > r. Then Pr[Ei = 1] < (1 — \J c/t) /2, so 
^Ei=i Ei] < t/2 — \fct/2. The E± are independent, so by a Chernoff bound we have that 



Pr 



5> 

.1=1 



> 



< e 



-2(Vci/2) 2 /t 



-c/2 



The same bound applies to the event that at least t/2 of the Xj are less than — r, and if neither 
event occurs then the median is in the interval (— r, r). □ 



4 Count- Sketch 



Theorem 4.1. Consider the estimate x of x from Count-Sketch using R rows and k>2 columns, 
with fully random hash functions. For any t < R and each index i, 



Pr 



2 ^ t In*] Ha 



< 2e 



-n(t) 



Proof. Fix i € [n]. For each row u and coordinate j £ [n], define 
For each row u, define 



s u (j)xj if = 
otherwise. 



Then, by definition, 



T u — and — -^wj- 

ie[fe]\W ie[ fe l\{*} 

£i — £i = median if u + T u . 



Each random variable X u j is symmetric, equals with probability 1 — 1/fc > 1/2, and other- 
wise equals ±£j. Moreover, for each row u, the random variables {X u j} are independent. Thus 
Corollary 13.21 shows that 

\T U \ < e ■ 



Pr 



Vk 



> 



c. 



Furthermore, H u = with probability at least (1 — l/k) k > 1/4, i.e., with constant probability. 
Since H u is independent of T u , this means that 



Pr 



\H U + T U \ < e- 



> 



e. 



Therefore Lemma 13.31 implies 



Pr 



< 2e~ n ^ 2 l 



Setting e = y/t/R yields the desired result. 



□ 



As an application to £2 reconstruction, we give the following theorem. Note that the condition 



\Xk\ ~ \x%k\ ^ llZjjfcJ 



\2/ Vk is satisfied by any power law distribution Xi = i a with a > 0.5. 
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Theorem 4.2. Suppose \xk\ — \x2k\ ^ pTnlh/V^- Let x be the result of Count-Sketch using 
R = O(logn) rows and 0(k) columns, with fully random hash functions. Let x* be the vector 
containing the largest k entries ofx. Then 

\\x-x*\\l < + 

with 3/4 probability. 

Proof. Let the number of columns be ck for some constant c. Let S C [re] contain the largest 
k coordinates of x. By the standard Count-Sketch bound we have with n~®^ probability that 
P — x lloo < Punlll/( c ^) =: A* 2 - Then for sufficiently large c, \xi\ > \xj\ for all i E [k] and j E [2k], 
so5C [2k]. 

Since S C [2k], we have 

Ps - = Ps - 111 + Pglli 

II ^ . . 2 II 1 1 2 

< \\ x [2k] - x [2k]h + Ppp2- 

For each i E [2fe], define Vi = min((xj — Xi) 2 ,ji 2 ). We have by Theorem 14.11 and Vi < fi 2 that 

PT [ Vi > A/x 2 ] < 2e- Q W 

for all t. It follows that E[Vi] < fi 2 /R, so E[X\ e[2Jfc] ^i] ^ 2kfj, 2 /R. By Markov's inequality, with 
constant (7/8, say) probability, 

^Vi< kfi 2 /R = ^Ppylll: < ^Pw^- 

ie[2fc] 

Now, conditioned on ||x — < // 2 , Vi = (x~i — Xi) 2 . Thus with at least 7/8 — n -0 ^ 1 ) > 3/4 
probability, and for sufficiently large c, we have ppfe] — £[2fc]||| — sll^TfejUi' whence 

Ps-x|ii < (i + ^)P[fc]lll- □ 
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